Detection of ChatGPT fake science with the xFakeSci learning algorithm

Generative AI tools exemplified by ChatGPT are becoming a new reality. This study is motivated by the premise that “AI generated content may exhibit a distinctive behavior that can be separated from scientific articles”. In this study, we show how articles can be generated using means of prompt engineering for various diseases and conditions. We then show how we tested this premise in two phases and prove its validity. Subsequently, we introduce xFakeSci, a novel learning algorithm, that is capable of distinguishing ChatGPT-generated articles from publications produced by scientists. The algorithm is trained using network models driven from both sources. To mitigate overfitting issues, we incorporated a calibration step that is built upon data-driven heuristics, including proximity and ratios. Specifically, from a total of a 3952 fake articles for three different medical conditions, the algorithm was trained using only 100 articles, but calibrated using folds of 100 articles. As for the classification step, it was performed using 300 articles per condition. The actual label steps took place against an equal mix of 50 generated articles and 50 authentic PubMed abstracts. The testing also spanned publication periods from 2010 to 2024 and encompassed research on three distinct diseases: cancer, depression, and Alzheimer’s. Further, we evaluated the accuracy of the xFakeSci algorithm against some of the classical data mining algorithms (e.g., Support Vector Machines, Regression, and Naive Bayes). The xFakeSci algorithm achieved F1 scores ranging from 80 to 94%, outperforming common data mining algorithms, which scored F1 values between 38 and 52%. We attribute the noticeable difference to the introduction of calibration and a proximity distance heuristic, which underscores this promising performance. Indeed, the prediction of fake science generated by ChatGPT presents a considerable challenge. Nonetheless, the introduction of the xFakeSci algorithm is a significant step on the way to combating fake science.


Introduction
With Large Language Models (LLMs) and generative AI tools (e.g., ChatGPT) 1 becoming a new reality, our world finds itself in a state of controversy.On one hand, there exists a camp of optimists who perceive their potential and seek to harness them.On the other hand, there are doubters who remain skeptical, seeking validation and further assessments to discern how this new paradigm will impact our lives.This division provides strong motivation for this study and catalyzes efforts towards providing a tool that assesses the capability of generating fake science by ChatGPT.Undoubtedly, real science, documented in scientific

Phase I: Analysis of Topological Properties of Network Training Models
We constructed two types of network training models: one derived from content generated through prompt-engineering with ChatGPT, and the other from PubMed abstracts.We examined the structural properties of these network models in terms of the number of nodes and edges.These analyses were conducted within the contexts of three diseases: Alzheimer's, cancer, and depression.The node counts computed from ChatGPT training models were 519, 559, and 577, respectively.In contrast, the number of nodes generated from scientific publications varied across different time periods: for the years 2010-2014, it was 742, 755, and 801; for 2015-2019, it was 774, 828, and 755; and for 2020-2024, it was 817, 817, and 790.Regarding edge counts, ChatGPT training models exhibited 1194, 1050, and 1108 edges, whereas publication network models produced 861, 803, and 878 edges for the years 2010-2014; 940, 977, and 826 for 2015-2019; and 958, 1030, and 809 for 2020-2024.
These findings, presented in Table 3, suggest that ChatGPT-generated datasets generally have fewer nodes compared to scientific articles.However, our analysis also revealed that ChatGPT network models tend to have a higher number of edges relative to publication datasets.This observation is visually depicted in Figure 1, which highlights the strikingly lower node-to-edge ratios of ChatGPT models compared to network models derived from scientific articles.

Phase II: Further Testing the Distinctive Behavior in ChatGPT-Generated Documents
To further investigate the premise, we conducted a test to analyze the mean ratios of contributing bigrams extracted from k-Folds against the document word count.This analysis aimed to establish a baseline for assessing the contribution of bigrams to the overall content structure.The results revealed a consistent pattern across all three disease datasets.Specifically, ChatGPT-generated datasets exhibited significantly higher ratios than their scientific publication counterparts in each of the k-Folds used.For instance, in the Alzheimer's disease dataset, ChatGPT scores were (0.27, 0.30, 0.30, 0.28, 0.28, 0.29), while scientific publications from 2010-2014 scored (0.16, 0.17, 0.16, 0.16, 0.17, 0.16), for 2015-2019 (0.15, 0.16, 0.15, 0.16, 0.14, 0.15), and for 2020-2024 (0.15, 0.15, 0.14, 0.15, 0.14, 0.14).These findings are consistent across the other two diseases, as evident in Table 4. Figures 2 and 3 clearly demonstrate that the k-Folds ratios calculated from ChatGPT-generated data are significantly higher than those derived from scientific publications across different years and scopes.They further illustrate a similar pattern for the cancer and depression datasets.This evidence reinforces the notion that ChatGPT-generated content may exhibit distinct characteristics compared to scientific articles.

Outcome of Label Prediction of Multi-Mode Classification Experiments
To establish confidence in our method and ensure consistent performance of the xFakeSci algorithm, we conducted two types of experiments in the subject area of three different diseases.Additionally, we performed experiments to evaluate whether the year of publication plays a role in class prediction.This section presents the outcomes of experiments utilizing ChatGPT-generated text obtained algorithmically using ChatGPT prompt-engineering, as outlined in Algorithm 1, and scientific publications retrieved from the PubMed web portal 37 related to the Alzheimer's, cancer, and depression diseases.
Here, we present the results of multi-mode experiments, where xFakeSci was trained using a combination of ChatGPT and PubMed abstracts and evaluated on a dataset of unseen documents from all three diseases.Specifically, we trained xFakeSci using an equal-sized dataset of ChatGPT-generated and PubMed abstracts.Then, we calibrated the algorithm using the exact number of k-Folds for each disease.For the PubMed dataset, we used abstracts of articles published between 2020 and 2024.
For each disease, we tested xFakeSci on 100 articles, comprising 50 PubMed abstracts and 50 ChatGPT-generated documents.Table 6 summarizes the performance of xFakeSci in this mode, capturing the number of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN) from both the publication and ChatGPT test items.Using the F1 measure, we note that xFakeSci scored 80%, 91%, and 89% for depression, cancer, and Alzheimer's, respectively.Figure 4 provides a comprehensive analysis of the classification results.
Table 6 demonstrates that xFakeSci detected all 50 PubMed publications for each disease (TP=50).Additionally, we observed that the algorithm identified the ChatGPT-generated documents to varying extents (TN=25, 41, 38) for depression, cancer, and Alzheimer's, respectively.It remains concerning that ChatGPT is classified as PubMed with (FP=25), indicating that 50% of the test documents are misclassified as real publications.Further research is needed to investigate and improve performance.

Outcome of Publication Period as a Factor for the Multi-Mode Classification Experiments
The purpose of this section is to test whether the publication period is a factor in making predictions and assigning labels to a dataset of documents with mixed classes (ChatGPT vs PubMed articles).Here, we use the F1 metric as a measure to present our results and explain the associated performance of our algorithm (captured in Table 8).For each disease dataset extracted from three periods (2020-2024; 2015-2019; 2010-2014), we computed the F1 score.
For the cancer disease dataset, the F1 scores recorded were 91%, 92%, and 94% for the three different periods.These scores show a consistent improvement in predicting the labels with older publication datasets, from 2010 to the present.

3/18
For the depression disease dataset, the F1 score remained constant over time at 80%, indicating that the pattern did not show deterioration over time.
For the Alzheimer's disease dataset, the scores showed a slight improvement from 89% (in 2020-2024) to 90% in (2015-2019), but it dropped back to 89% for the (2010-2014) disease dataset.While the pattern of prediction improvements did not hold as we analyzed older publications, the score did not degrade below the F1 score of 89% for (2020-2024).

Outcome of xFakeSci Performance Analysis against other Data Mining Algorithms
We compared the performance of xFakeSci against some of the most common state-of-the-art algorithms.Specifically, we conducted various performance evaluation experiments against the following algorithms: (1) Naive Bayes, (2) Support Vector Machine (SVM), (3) Linear Support Vector Machine (SVM), and (4) Logistic Regression.Some of these algorithms are listed among the Top-10 Data Mining Algorithms 38 .To establish fairness, we trained each of the algorithms using the exact training dataset used against xFakeSci in multi-mode (where we train and test with mixed datasets).For training, we used the first 100 PubMed abstracts and the first 100 documents of the ChatGPT-generated dataset.As for testing, we used a combination of 50 PubMed abstracts followed by 50 ChatGPT-generated documents.
Each algorithm was used as a blackbox and received input from two sources (training and test), and in turn, produced a detailed analysis in the form of (TP, TN, FP, FN).Using these metrics, the F1 score was computed accordingly.Figure 6 visually depicts the F1 scores observed over time for the 5 algorithms (including xFakeSci) in three different diseases, presented by sub-figures, one for each disease.Each sub-figure shows stacked bars for the three periods of publications, and each bar represents the F1 score resulting from a given algorithm.Table 7 captures the performance analysis of the xFakeSci algorithm against other classical data mining algorithms (Naive Bayes, Linear SVM, Classical SVM, and Logistic Regression).The performance is presented using the F1 metric for publications spanning between 2020 and 2024.The table shows that xFakeSci scores ranged from 80% to 91%, while those of the data mining algorithms fluctuated between 43% and 52%.
Moving to the period of 2014 to 2019, xFakeSci demonstrated F1 scores ranging from 80% to 92%, compared to other data mining algorithms which exhibited F1 scores in the range of 43% to 51%.Lastly, during the period of 2010 to 2014, the F1 scores achieved by xFakeSci fluctuated between 80% and 94%, whereas the F1 scores of the other data mining algorithms were recorded between 38% and 52% as shown in Table 8. Figure 5 shows a screenshot providing evidence of achieving 91% accuracy measured by the F1 metric, while the F1 scores of the other data mining algorithms fluctuated from 43% to 51%.All three sub-figures show a consistent pattern where xFakeSci clearly outperforms all the other four algorithms.The F1 score is calculated using Equation 1, and the analysis was done using the scikit-learn library 39 .

Data Collection
We compiled two distinct types of datasets for the study: (1) Literature dataset: To establish a baseline for comparison and train the xFakeSci algorithm, we utilized the PubMed archive to retrieve scientific articles.We employed three search queries: (a) "Alzheimer's disease and co-morbidities," which we generated 1196 JSON records, (b) "cancer and co-morbidities", generated 1243 JSON records, and (c) "depression" which generated 1513 JSON records.To assess the influence of publication year on fake science detection, we conducted these searches at five-year intervals, resulting in three distinct datasets for each disease (2010-2024).( 2) ChatGPT-generated dataset: We obtained this dataset by programmatically prompting the ChatGPT API (version: 3.5, model: "gpt-3.5-turbo-16k") to generate simulated articles.The prompt-engineering process comprises two primary components: • Prompt Engineering ChatGPT for Simulated Article Generation: We employed ChatGPT, a generative AI tool, for generating simulated articles.We implemented a prompt-engineering technique to guide the process of text generation in three distinct diseases: Alzheimer's, cancer, and depression.
• Predicting Fake Science: We devised a network-centric learning algorithm, called xFakeSci, which we trained on text documents of published scientific articles and ChatGPT-generated documents.The algorithmic steps involved in this process are explained in the subsequent sections.

ChatGPT Prompt Engineering for Article Generation
We utilized the ChatGPT API (version: 3.5, model: "gpt-3.5-turbo-16k") to engineer prompts for generating simulated articles in the subject areas of real published articles (Alzheimer's, cancer, and depression).These prompts were parameterized using information from the real articles (search keywords used for article retrieval and the average number of words) to make them comparable to real abstracts.They included three key elements: (a) the role of the prompt: "a biomedical researcher," (b) the request example: "Generate a list of 20 simulated PubMed-style abstracts," (c) topic example: "the Alzheimer's disease," and (d) specifications: each article must contain ID, Title, and Abstract fields.We also instructed prompts to generate a valid JSON response with these specifications.The number of words helped offset any bias and made the fake articles comparable to the precise level of detail required (the 200-250 words range is a common requirement by many prominent biomedical informatics journals).Table 1 captures the search queries and the number of fake articles generated in the JSON format.Total number of generated articles 3952 The prompt-engineering process is computationally described in Algorithm 1.Although this process was done programmatically, due to the timeout limit, we executed it to produce 20 simulated articles at a time.This prompt-engineering approach enabled us to generate a large corpus of simulated articles that closely resembled real scientific publications in terms of structure, content, and overall style.This dataset played a crucial role in training the xFakeSci algorithm, enabling it to accurately distinguish between real scientific articles and machine-generated ones.
for For each disease d in diseases do [System Role: ] You are a biomedical researcher specialized in studying d disease.
[Request Content:] Generate a list of article_number simulated PubMed-style abstract.
[Request Details:] Provide disease and co-mobridities detailed information.
[Response Format:] A valid JSON format returned as an array of valid JSON records.end for Return an array list of ChatGPT-generated article in JSON format.

Prediction of Fake Articles using xFakeSci
The xFakeSci algorithm is a network-driven label prediction algorithm designed to distinguish between real scientific articles and machine-generated ones.This entails that the algorithm has two main tasks: training the model and testing it to detect the label of entirely new documents that have never been seen before.In this section, we introduce the computational steps that describe the prediction process, starting with (1) the construction of network training models, (2) the calibration of the algorithm, and (3) the label prediction for each of the ChatGPT-generated articles.

Model Derivation and Network Construction
Since our training model is network driven from text, we used the Term Frequency-Inverse Document Frequency (TF-IDF) [40][41][42][43][44][45] to extract word features as building blocks of the training models.The TF-IDF algorithm can be configured to generate two consecutive words (known as bigrams) that may prove significant across an entire dataset 46,47 .Equation 2shows the mathematical representation of the TF-IDF.where tf(t, d) is the frequency of bigram t in document d,and idf(t, D) is the inverse document frequency of bigram t in the document set D.
The term frequency of a term t is calculated as the ratio between the number of occurrences of the term divided by the total number of terms in a document d as show in Equation 3. The inverse document frequency (idf(t, D)) is calculated as in Equation 4where N is the total number of documents in the collection and df(t, D) is the number of documents containing the bigram t.
To construct a training model, we extracted bigrams to form a network model as follows: the individual words of a bigram served as nodes, and edges represented the relationship between the bigrams.To illustrate the utility of bigrams in constructing the training model, let's consider a scenario concerning the "depression disease": bigrams such as "mental health," "health condition," and "condition worsen" form connections based on the common words they share, enabling a network that can be analyzed for various purposes.Using this mechanism, we constructed two distinct training models: one from the abstracts of published literature (labeled as the "PUBMED" class), and another from ChatGPT-generated text (labeled as the "GPT" class).To ensure fairness and prevent biases, both models were constructed from the same number of documents (100 abstracts and 100 ChatGPT-generated).Both datasets were processed using identical series of steps, including removing stopwords and sentence tokenization.
Algorithm 2 outlines the steps involved in building the network model from bigrams.We applied this algorithm twice, once to create a publication training model and another to create a ChatGPT model.We recorded the corresponding statistics (numbers of nodes and edges) for each model in Table 3.The initial observations revealed a consistent pattern where models constructed from ChatGPT-generated text exhibited the lowest number of nodes, yet they also maintained the highest number of edges.The resulting models showed disconnected components and fragmented communities, requiring pruning.This need was satisfied by applying the Largest Connected Components (LCC) algorithm 48 , which ensured that the resulting networks maintained high connectivity.The LCC presents an admissible pruning heuristic due to the presence of high-degree nodes that promote network stability and robustness [49][50][51][52] .

Evaluating the Premise of ChatGPT's Distinctive Behavior
As mentioned earlier, the training models are constructed from the first 100 articles from each dataset.To test the premise of how ChatGPT may exhibit distinctive behavior, we divided the remaining articles into k-Folds, each containing 100 articles.The main idea of such a test was to measure the impact of each fold on the corresponding training model, specifically how the bigrams extracted from each of the folds altered the Largest Connected Components (LCCs) of their respective data types.
For the Alzheimer's disease dataset, we constructed three training models: the impact of bigrams was determined by calculating the mean ratio between the number of bigrams contributing to the LCC and the total number of words in each article R.append(doc_ratio) ▷ Keep track of the ratios 13: end for 14: fold_mean ← compute the mean of the ratios mean(R) 15: return fold_mean within a fold.This process is captured by measuring the average contribution rate of the bigrams of a given fold.Algorithm 3 provides the pseudocode for this step.
We summarized the analysis of each dataset and disease in Table 4.The initial observations indicated that the ChatGPT ratios fluctuated between 27% and 30% for the Alzheimer's disease, 27% and 29% for cancer, and 28% and 32% for depression.In contrast, the ratios derived from scientific articles ranged between 14% and 16% for the Alzheimer's disease, 14% and 17% for cancer, and 9% and 11% for depression.The full analysis of these results will be discussed in the "Results" section.The analysis demonstrated that the ratios of ChatGPT-generated documents were significantly different from those computed from scientific articles.This distinction serves as proof that the premise is indeed true.Furthermore, this knowledge provides lower and upper bounds for each disease, offering more guidance to the algorithm to predict the label while avoiding issues of overfitting.Algorithm 4 demonstrates how the ratios are computed.For brevity, we only demonstrate for the depression disease.Table 5 presents the corresponding lower and upper bounds, which are necessary during the calibration phase.

Label Prediction of Articles: Real vs Fake
Testing the premise in the above section demonstrated the fundamental differences in content behavior between fake ChatGPT articles and real publications.In this section, we present the learning algorithm, which is the main contribution of this paper.During the Coronavirus global pandemic, our previous work addressed the challenge of detecting fake news and science as an emerging infodemic 15 .However, this work was limited by the lack of comprehensive machine-generated datasets that could adequately assess performance in the presence of fake data.Now, with the advent of ChatGPT and generative AI technologies, we can generate diverse datasets using prompt-engineering algorithms as demonstrated above.Additionally, the previous work did not make use of data-driven insights, which we incorporate as a calibration step.This is an intermediate phase that takes place after the training phase and before the label prediction phase.Due to these factors, the previous work was limited to a single-mode label prediction using a single type of dataset.Therefore, it was necessary to split the dataset into training and test sets.The following is a complete comparison of the previous work and the current features presented by xFakeSci, such as content type, configuration parameters, classification mode, calibration, and classification, as presented in Table 2.
As shown in Table 2, the xFakeSci algorithm is particularly designed to address multi-mode classification.Therefore, it is expected to train the algorithm using two or more independent types of data.Consequently, the algorithm also expects a hybrid test set of mixed types and will produce more accurate labels for each type.However, such modes suffer from what is known as the "overfitting" issue [53][54][55][56] .The introduction of the calibration step (by calculating the lower/upper bound ratios captured in Table 5) was to guide the decision of the final label prediction and avoid such an issue.The table demonstrates a clear separation of lower/upper bound ratios.Therefore, we further utilize such a mechanism by incorporating a calibrating step to further guide the classification process without having to train the algorithm with too many samples.The algorithmic steps for the calibration process are explained in Algorithm 4. Though the ranges provide an extra net for predicting the label, it is also possible that some document instances may fall outside the specified ranges of the datasets, which could result in not Input: List of ChatGPT k-Fold ratios R: [r 0 , r 1 , r 2 , ..., r k ] 3: Output: Dictionary of key/value pairs where the key is the name of the dataset, and the value is the lower/upper bound range.

5:
gpt_ranges ← [] for each ratio in R do for each ratio in R ′ do 14: pubmed_lower_upper ← range(pubmed_min, pubmed_max) predicting a label correctly.Therefore, we introduced a proximity heuristic that favors the shortest distance to the ranges driven from the individual datasets (real or ChatGPT-generated) and assigns a label accordingly.Equations 5 and 6 demonstrate how the distance is calculated.
Algorithm 5 illustrates the computational steps for multi-mode execution, demonstrating the complexity involved, including the proximity distance.To use the algorithm in detecting fake science, it must be trained using two different types of data: (1) a real publication dataset and (2) ChatGPT-generated articles.The algorithm also expects the ratio means of each data source, which are computed using the calibration algorithm.

Discussion
In a world where generative AI has become widespread, various studies aimed to investigate the potential issues of using ChatGPT to generate fake science.The literature review showed a desperate need to advance the algorithmic approaches to discern real publications from fake ones, especially, when they are mixed.Our study aimed to address such issues incrementally.dist range 1 ← compute_distance_to_range(ratio 1 , range 1 ) assign GPT as a label to article d capture stats ← score prediction 21: end for 22: return stats Specifically, we first tested the intuition of whether the content generated by ChatGPT may exhibit unique characteristics that distinguish it from real science.We explored this task using prompt engineering, where we created engineered datasets on the subjects of Alzheimer's, cancer, and depression diseases.In this work, we contributed a prompt-engineering algorithm on how to generate simulated content to evaluate this premise.Working with plain text (using publication abstracts or generated from ChatGPT), using the TF-IDF algorithm is a common approach to generate bigrams that can be used to construct more complex models.
Our initial observation of networks generated from ChatGPT content is that they are highly connected and contain fewer nodes compared to networks constructed from real publication text.Additionally, when we calculated the ratios of the number of bigrams against the total number of words of documents on k-Folds, we found that the ratios of ChatGPT content are much higher than scientific abstracts.These two indications supported our intuition that ChatGPT documents exhibit distinguishable behavior than PubMed abstracts.One interpretation of this observation could be due to the inherent design of the ChatGPT engine.As observed, ChatGPT is optimized to generate highly convincing content by predicting the next correlated terms statistically using a Large Language Model.On the other hand, scientists prioritize accurate documentation of hypothese testing, scientific experiments, and careful explanation of observations.Describing science in terms of highly correlated words is not a goal of scientists.Clearly, the difference in goals may contribute to less connectivity in scientific publications.
Further, we introduced the xFakeSci algorithm, a learning algorithm that predicts a label for a given article.In the Methods section, we showed that it is designed to operate in two modes: (1) Single-mode: where only one type of articles from the same source is used for training and a new set of documents from the same pool is used for predicting the label of an article; and (2) Multi-mode: where the algorithm was trained from two sources and a hybrid train model (of real and generated datasets) was constructed to make the predictions.The single-model is trivial; therefore, we focused our experiments on demonstrating the multi-mode.We performed several experiments to do the following: (1) to test and measure how the xFakeSci algorithm predicts labels of ChatGPT generated documents for a given disease when mixed with scientific abstracts, (2) to evaluate whether the algorithm performs consistently using various datasets of different diseases not only one disease, (3) to test whether the year of the publication plays a role in predicting ChatGPT generated documents when mixed with publications from various periods (2020-2024, 2015-2019, 2010-2014), and (4) to benchmark the algorithm against a baseline of some of the most common data mining algorithms.Our results for each experiment used the TP, TN, FP, FN metrics and F1 scores.
When testing whether the year of publication plays a role in label prediction, we observed F1 scores of 91%, 92%, and 94% for cancer-related publications across different periods.This suggests a pattern of better detection of ChatGPT articles when mixed with older publications.However, identifying newer publications proved more challenging.For the Alzheimer's disease, while no improvement was observed, degradation was also absent.As mentioned earlier, the Alzheimer's datasets were the smallest among all datasets, limiting the calibration process due to fewer k-Folds compared to other diseases.In the case of depression, the algorithm exhibited consistent performance with an F1 score of 80% across all periods.It's plausible that mental health data acquisition posed limitations, potentially constraining resources from this specific area.Testing this hypothesis involves measuring document similarity between PubMed and ChatGPT sources using lexical and semantic analysis.
Upon benchmarking xFakeSci against classical data mining algorithms, we observed an interesting pattern: xFakeSci correctly predicted all the scientific publications in all the experiments we performed.However, other algorithms misclassified publications as ChatGPT and vice versa (true positives, false positives, false negatives, and true negatives).xFakeSci, however, needed improvement in predicting true negatives (ChatGPT documents), as many ChatGPT documents were labeled as true positives (real publications).In all the experiments, the F1 scores of xFakeSci ranged between 80% and 94%.In contrast, the other data mining algorithms showed much lower performance, with F1 scores ranging between 32% and 52%.We attribute the high performance of xFakeSci to the calibration process, which was guided by ratios and proximity distances.Although the training model remained lightweight, both heuristics provided more guidance for predicting fake articles.This novel calibration method benefits from an abundance of data, without suffering from overfitting issues like other common classification algorithms.Clearly, the xFakeSci algorithm does not suffer from such a deficiency in identifying real articles when mixed with ChatGPT-generated content While xFakeSci is designed to distinguish fake science from real, it can be applied to various types of text data, including clinical notes, clinical trial summaries, and interventions.With the widespread adoption of generative AI tools such as ChatGPT and Google Bard, ethical concerns may arise, such as clinicians using ChatGPT to generate clinical notes, potentially resulting in erroneous entries with serious consequences.In such cases, our algorithm may serve as a forensic tool to identify potentially fake portions of these reports.
While we have highlighted the potential for harm posed by ChatGPT and similar tools, it is also important to recognize their positive generative capabilities.For instance, ChatGPT played a crucial role in providing our algorithm with simulated data, which was essential for our work during the global pandemic in detecting fake news and publications 15 .Moreover, ChatGPT can generate code snippets as building blocks for various basic tasks, including data visualization, across diverse programming languages.We are currently exploring this capability to construct workflows for life sciences applications.Additionally, the ChatGPT engine can effectively convert semi-structured content into popular formats like JSON, XML, and others.While these capabilities are undoubtedly useful, they necessitate the development of ethical standards to ensure responsible use of such tools.
Another intriguing potential use is that, when creatively engineered, ChatGPT could function as a valuable teaching assistant for academics and school teachers.It could potentially generate various ways to present questions while maintaining the integrity of the original content.Furthermore, ChatGPT could revolutionize scientific writing by providing support in addressing grammatical errors, typography, and paraphrasing, particularly for those whose native language is not English 57 .

Conclusions and Future Directions
When we asked a high school student about their knowledge of ChatGPT, they responded, "Do you mean that tool that does my homework for you?" Indeed, ChatGPT is an incredibly sophisticated tool with a wide range of impressive capabilities.Since the rise of ChatGPT, many new research topics have opened a new generative door, and many long-standing questions are now being investigated.However, the most significant concern associated with ChatGPT and other generative AI tools is that they could pose a threat to the future of science.If younger generations utilize ChatGPT to plagiarize, it could undermine the integrity of research and learning, potentially having a negative impact on the development of future pioneers.
While learning algorithms, such as xFakeSci, can assist in identifying fake science, there is an ethical obligation to use generative AI tools responsibly and regulate their usage 16 .It is worth noting that certain countries, such as Italy, have taken the extreme step of banning ChatGPT.While the authors believe such measures may be drastic, addressing ethical concerns is a new frontier that must be tackled.As ChatGPT itself states, "It is up to individuals and organizations to use technology like mine in ways that promote positive outcomes and minimize any potential negative impacts."As advised by Anderson et al., it is also the responsibility of publishers and those involved in the production of science to play a proactive role in promoting good science.This includes raising awareness of the importance of implementing advanced fake science detection algorithms, including ours, and activating the use of technologies to distinguish fake research and fabricated findings 35 .
Looking ahead, there are several avenues for future research based on our current work: (1) conducting a preprocessing step (e.g., clustering) to group more closely related publications together (e.g., breast cancer, prostate cancer, and others), or separate diseases from co-morbidities.The use of knowledge graphs may be a powerful tool to use in continuing to investigate this research direction; (2) further experimentation in training and calibrating the xFakeSci algorithm by utilizing heuristics learned from preprocessing steps and the discoveries of clusters; and (3) testing the algorithm on more than two data sources (clinical reports, publications, and ChatGPT-generated documents).

Algorithm 2 4 :
Network Construction from Bigrams Computed from Text Documents (PubMed, and ChatGPT) Require: D: [document dataset (PubMed or ChatGPT))] T : training graph B: empty list of bigrams Ensure: G : fully populated graph 1: for each document d in D dataset (either from ChatGPT or PubMed) do 2: B ← Compute Term Frequency -Inverse Term Frequency tf-idf(d) 3: for each bigram ⌊ in B encoded as source and target b(s,t) do Form an edge e(s,t) for the unigram constituents of the bigram b.

5 : 6 :
if e does not exist in the training graph: T then Add e to the training model T .end for 11: return T a fully populated graph from dataset bigrams

6 / 18 Algorithm 3 for each bigram b in B do 7 :
Compute the Mean of Ratios for a Corresponding Fold Require: Training model M, Require: One fold of 100 articles from k-folds, Require: Computed ratios of each fold R ← [] ▷ Initialize as an empty list Ensure: Fold mean 1: for each document d in D do if b contributes an edge to the corresponding training model M then

8 / 18 Algorithm 5 6 : 7 : 9 :
Multi-Mode Execution of the xFakeBib Algorithm 1: Input: [ChatGPT Model G, PubMedModel P] 2: Input: [ChatGPT − ratios range 1 , PubMed − ratios range 2 ] 3: Input: a dataset D of mixed fake and real documents 4: for each document d in D dataset do 5: ratio 1 ← compute_model_contribution(d, G) ratio 2 ← compute_model_contribution(d, P) if ratio 1 in range 1 then 8: assign GPT as a label to document d else if ratio 2 in range 2 then 10:assign PubMed as a label to document d

Figure 2 .
Figure 2. Comparison of Calibrating Ratio Means for the Cancer Disease.

Figure 3 .
Figure 3.Comparison of Calibrating Ratio Means for the Depression Disease.

Table 1 .
Result Counts of Prompt Engineering

Table 2 .
Comparison between NeoNet and xFakeSci Algorithms

16 :
else if dist range 2 < dist range 1 then

Table 3 .
Phase I Premise Testing: Summary of Nodes and Edges for Training Models from Different Sources: PubMed vs ChatGPT Datasets

Table 4 .
Phase II Premise Testing: Summary of Ratio Means for Different Diseases and Datasets

Table 5 .
Model Calibration: Summary of Lower and Upper Bound Ranges for Different Diseases and Datasets by Year Periods: 2010-2014, 2015-2019, 2020-2024

Table 7 .
Multi-Mode Experiments: F1 Classification Scores From Most Recent Publications

Table 8 .
Multi-Mode Experiments: F1 Classification Scores By Older Periods